110        Bioinformatics

in this case, can either lead to no change in the structure and function of the protein

(­conservative mutation) or it may lead to a deleterious consequence if the change alters

the protein ­structure and function. The base substitution may also have a nonsense con-

sequence if it results in a stop codon that truncates the translated protein leading to an

incomplete and nonfunctional protein.

A deletion mutation is the removal of a single pair of nucleotides or more from a gene

that may result in a frameshift and a garbled message and nonfunctional product. Deletion

may have deleterious consequence or not depending on the part it alters and its impact

on the protein sequence. The insertion mutation is the insertion of additional base pairs

and it may lead to frameshifts depending on whether or not multiples of three base pairs

are inserted. Mutations may include combinations of insertions and deletions leading to a

variety of outcomes.

In general, a gene variant is a permanent change in the nucleotide sequence of a gene

that can be either germline variants, which occur in eggs and sperms of parents and pass

to offspring, or somatic variants, which are present only in specific cells and are generally

not hereditary.

In terms of sequence change, variants can be classified into single-nucleotide variant

(SNV), insertion–deletion (InDel), or structural variation (SV). The SNV is a base substi-

tution of a single nucleotide for another. It is known as single-nucleotide polymorphism

(SNP) if its allelic frequency in a population is more than 1%. InDel refers to insertion and/

or deletion of nucleotides into genomic DNA and it includes events less than 1000 nucleo-

tides in length. InDels are implicated as the driving mechanism underlying many diseases.

The SV involves change in more than 50 base pairs in a sequence of a gene; the change may

include rearrangement of part of the genome, a deletion, duplication, insertion, inversion,

translocation, or a combination of these. A CNV is a duplication or deletion that changes

the number of copies of a particular DNA segment within the genome. SVs have been

implicated in a number of health conditions.

In this chapter, we will learn about the major steps in the process of variant identifica-

tion and analysis, including variant representation, variant calling workflow, and variant

annotation. The process by which we identify variants from sequence data (reads) is called

variant calling, which is the central topic of this chapter.

4.1.1  VCF File Format

Since a variant is a change in a specific location in a genome, in bioinformatics, this

requires a format that can describe the type of a mutation and its position relative to the

genome coordinates. Thus, the variant call format (VCF) file [2] was developed to hold

the information of a large number of variants and also to hold genotype information of

multiple samples in the same position. The VCF file, as shown in Figure 4.1, consists of (i)

a metadata section for the meta-information and (ii) a data section for variant data. The

VCF file has become the standard file for storing variant information for almost all variant

calling programs.

Each line in the metadata section of a VCF file begins with “##”. The metadata lines

describe the format and content of a VCF file. This can include information about the